Fine-Tuning Qwen2-VL-7B for Nutrition Table DetectionΒΆ
Fine-tune Qwen2-VL (a Vision-Language Model (VLM)) to detect nutrition tables in product images, starting from a zero-shot baseline and ending with LoRA-based experiments.
Project Abstract: This notebook documents the end-to-end process of fine-tuning the Qwen2-VL-7B model for nutrition table detection. Starting from a strong zero-shot baseline (0.590 Mean IoU), I systematically explored three QLoRA fine-tuning strategies, overcoming significant memory and hardware challenges. The best model achieved a Mean IoU of 0.771, a 30.7% relative improvement, demonstrating the effectiveness of parameter-efficient fine-tuning for specialized vision-language tasks.
π Table of ContentsΒΆ
- Introduction & Motivation
- Environment & Setup
- Dataset Overview & Visualization
- Understanding the Qwen2-VL Model
- Zero-Shot Baseline Evaluation
- Fine-Tuning Strategy and Data Preparation
- Rationale for Parameter-Efficient Fine-Tuning (PEFT)
- Fine-Tuning Experiments and Training
- Checkpoint Evaluation
- Final Results and Analysis
- Production Deployment: Merging LoRA Adapters
Introduction & MotivationΒΆ
In this notebook, I fine-tune Qwen2-VL-7B for detecting nutrition tables from product images from Hugging Face.
If you are new to this kind of work, check out Daniel Godo's book. A Hands-On Guide to Fine-Tuning Large Language Models with PyTorch and Hugging Face by Daniel Voigt Godoy
#!pip install torch>=2.1 torchvision>=0.16 torchaudio>=2.1 \
# numpy pillow datasets>=2.20 transformers>=4.42 \
# accelerate>=0.27 trl>=0.9 peft>=0.12 safetensors>=0.4 \
# huggingface_hub>=0.23 tqdm matplotlib \
# bitsandbytes>=0.43 \
# qwen-vl-utils \
# seaborn
# Standard library
import os
import re
import json
from pathlib import Path
from pprint import pprint
import time
import gc
# Third-party
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import ImageDraw, Image, ImageFont
import torch
from torch.utils.data import DataLoader
from torchvision.ops import box_iou
from datasets import load_dataset
from transformers import (
AutoModelForImageTextToText,
AutoProcessor,
BitsAndBytesConfig,
Qwen2VLForConditionalGeneration,
)
from peft import LoraConfig, PeftModel, get_peft_model
from trl import SFTConfig, SFTTrainer
from huggingface_hub import login
from tqdm.auto import tqdm
import bitsandbytes as bnb
from qwen_vl_utils import process_vision_info, vision_process
# import wandb
/workspace/conda/envs/qwen2vl/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'repr' attribute with value False was provided to the `Field()` function, which has no effect in the context it was used. 'repr' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type. warnings.warn( /workspace/conda/envs/qwen2vl/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'frozen' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'frozen' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type. warnings.warn(
print("Torch:", torch.__version__, "| CUDA build:", torch.version.cuda, "| CUDA avail:", torch.cuda.is_available())
if torch.cuda.is_available(): print("GPU:", torch.cuda.get_device_name(0))
Torch: 2.4.1+cu121 | CUDA build: 12.1 | CUDA avail: True GPU: NVIDIA A100 80GB PCIe
Optional Settings for an Improved Jupyter ExperienceΒΆ
from IPython.display import display, HTML, set_matplotlib_formats
display(HTML("<style>.container { width:100% !important; }</style>"))
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%config InlineBackend.figure_format = 'retina'
# Only show explicitly plt.show() outputs
%matplotlib inline
plt.ioff()
<contextlib.ExitStack at 0x70cb0241cb20>
login(token="YOUR_HF_TOKEN_HERE")
# !pip install hf_transfer
# login(token=os.environ["HUGGINGFACE_TOKEN"])
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"
Utility functionsΒΆ
def clear_memory():
# clear the current variables and clean the GPU to free up resources.
# Delete variables if they exist in the current global scope
if 'inputs' in globals(): del globals()['inputs']
if 'model' in globals(): del globals()['model']
if 'processor' in globals(): del globals()['processor']
if 'trainer' in globals(): del globals()['trainer']
if 'peft_model' in globals(): del globals()['peft_model']
if 'bnb_config' in globals(): del globals()['bnb_config']
time.sleep(2)
# Garbage collection and clearing CUDA memory
gc.collect()
time.sleep(2)
torch.cuda.empty_cache()
torch.cuda.synchronize()
time.sleep(2)
gc.collect()
time.sleep(2)
print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
def parse_bounding_boxes(response_text: str) -> list:
"""
Parses a model's text response to find bounding box coordinates.
- Flexibly finds all numbers (int or float) in the text.
- Groups them into bounding boxes of 4.
- Converts them from the Qwen 0-1000 scale to a normalized [0, 1] scale.
- Returns a list of lists, with each inner list being [x_min, y_min, x_max, y_max].
- It intelligently sorts the coordinates to ensure (x_min, y_min) is the top-left corner.
"""
all_numbers_str = re.findall(r'[-+]?\d*\.\d+|\d+', response_text)
if len(all_numbers_str) < 4:
return []
all_numbers = [float(n) for n in all_numbers_str]
num_boxes = len(all_numbers) // 4
parsed_boxes = []
for i in range(num_boxes):
start_index = i * 4
box_nums = all_numbers[start_index : start_index + 4]
c1, c2, c3, c4 = box_nums
x1, y1, x2, y2 = c1 / 1000.0, c2 / 1000.0, c3 / 1000.0, c4 / 1000.0
x_min = min(x1, x2)
y_min = min(y1, y2)
x_max = max(x1, x2)
y_max = max(y1, y2)
parsed_boxes.append([x_min, y_min, x_max, y_max])
return parsed_boxes
# Optional tests function and test cases
def run_parser_test_suite():
"""
Tests the parse_bounding_boxes function against various text inputs.
The expected format is a list of lists: [[x_min, y_min, x_max, y_max], ...],
with all coordinates normalized to the [0, 1] range.
"""
test_cases = {
"official_four_coords": {
"input": "I found two boxes. The first is at 0,12,0,35. The second is 0,67,0,85.",
"expected": [[0.0, 0.012, 0.0, 0.035], [0.0, 0.067, 0.0, 0.085]]
},
"official_two_pairs": {
"input": "The box is at (10, 20), (300, 400)",
"expected": [[0.01, 0.02, 0.3, 0.4]]
},
"three_boxes_two_pairs": {
"input": "(1,2),(3,4) and (5,6),(7,8) also (9,10),(11,12)",
"expected": [[0.001, 0.002, 0.003, 0.004], [0.005, 0.006, 0.007, 0.008], [0.009, 0.01, 0.011, 0.012]]
},
# --- THIS TEST CASE IS NOW CORRECTED ---
"brackets_float": {
"input": "[0.1, 0.2, 0.3, 0.4]",
"expected": [[0.0001, 0.0002, 0.0003, 0.0004]] # One box from four numbers
},
"conversational_text": {
"input": "I think the nutrition table is around 150, 200, 550, 750 on the label.",
"expected": [[0.15, 0.2, 0.55, 0.75]]
},
"desc w/ (int, int..)": {
"input": "bounding_box: (0, 0, 1000, 1000)",
"expected": [[0.0, 0.0, 1.0, 1.0]]
},
"no_box": {
"input": "There is no nutrition table in this image.",
"expected": []
}
}
print("--- Running Final Parser Test Suite ---")
all_passed = True
for name, case in test_cases.items():
result = parse_bounding_boxes(case["input"])
is_correct = True
if len(result) != len(case["expected"]):
is_correct = False
else:
for res_box, exp_box in zip(result, case["expected"]):
if not all(abs(r - e) < 1e-6 for r, e in zip(res_box, exp_box)):
is_correct = False
break
if is_correct:
print(f"β
PASSED: {name}")
else:
print(f"β FAILED: {name} | Got: {result}, Expected: {case['expected']}")
all_passed = False
if all_passed:
print("\nπ All tests passed!")
# run_parser_test_suite()
SYSTEM_MESSAGE = (
"You are a vision-language model specializing in nutrition-table detection.\n"
"Detect every nutrition table in the image and respond only with lines of the form:\n"
"nutrition-table<box(x_min, y_min),(x_max, y_max)>\n"
"Coordinates are integers between 0 and 1000 in a normalized coordinate system (x first, then y).\n"
"If multiple tables exist, return each on a separate line. Do not extract or describe text."
)
USER_PROMPT = "Detect all nutrition tables in this image and return the boxes."
def run_inference(example_or_image, *, model=None, processor=None, prompt=None, max_new_tokens=128):
"""
Runs inference on a dataset example (raw or mapped) or a raw PIL image.
"""
mdl = model if model is not None else globals().get("model")
proc = processor if processor is not None else globals().get("processor")
if mdl is None or proc is None:
raise ValueError("Pass `model`/`processor`, or keep globals with those names available.")
# Handle raw dicts with or without 'messages'
if isinstance(example_or_image, dict):
if "messages" in example_or_image:
messages = example_or_image["messages"]
image = example_or_image["image"]
else:
image = example_or_image["image"]
messages = [
{"role": "system", "content": SYSTEM_MESSAGE},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt or USER_PROMPT},
],
},
]
else:
image = example_or_image
messages = [
{"role": "system", "content": SYSTEM_MESSAGE},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt or USER_PROMPT},
],
},
]
# Format messages
formatted_messages = []
for message in messages:
role = message.get("role")
content = message.get("content")
if isinstance(content, list) and content and isinstance(content[0], dict) and "type" in content[0]:
formatted_messages.append(message)
continue
text = content if isinstance(content, str) else ""
if role == "user":
formatted_messages.append({
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": text.replace("<|image_1|>", "").strip()},
],
})
else:
formatted_messages.append({
"role": role,
"content": [{"type": "text", "text": text}],
})
text_prompt = proc.tokenizer.apply_chat_template(
formatted_messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = proc(
text=text_prompt,
images=[image],
return_tensors="pt",
padding=True,
).to(mdl.device)
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(mdl.dtype)
with torch.no_grad():
generated_ids = mdl.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
num_beams=1,
)
trimmed_generated_ids = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
# Decode only the NEW tokens
response = proc.batch_decode(trimmed_generated_ids, skip_special_tokens=True)[0]
return response
Dataset Loading & ExplorationΒΆ
In this section, the openfoodfacts/nutrition-table-detection dataset. This dataset contains product images, the extracted bar codes, and bounding boxes for the nutrition tables.
# load the dataset into training and evaluation sets
dataset_id = "openfoodfacts/nutrition-table-detection"
dataset_train_raw = load_dataset(dataset_id, split="train")
dataset_test_raw = load_dataset(dataset_id, split="val")
example =dataset_train_raw[657]
pprint(example)
# raw image (scaled down)
_ = plt.figure(figsize=(6,6))
_ = plt.imshow(example["image"])
_ = plt.axis("off")
_ = plt.show()
{'height': 3053,
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2866x3053 at 0x70CB0241E2C0>,
'image_id': '3257983357752_2',
'meta': {'barcode': '3257983357752',
'image_url': 'https://static.openfoodfacts.org/images/products/325/798/335/7752/2.jpg',
'off_image_id': '2'},
'objects': {'bbox': [[0.6387160420417786,
0.22644801437854767,
0.8336063027381897,
0.5094208121299744],
[0.5224369764328003,
0.8314724564552307,
0.6226662397384644,
0.8967201709747314],
[0.521781861782074,
0.8963712453842163,
0.6233212947845459,
0.9637125134468079]],
'category_id': [0, 2, 2],
'category_name': ['nutrition-table',
'nutrition-table-small-energy',
'nutrition-table-small-energy']},
'width': 2866}
Dataset Overview & Visualization: Nutrition Table DetectionΒΆ
This project utilizes the openfoodfacts/nutrition-table-detection dataset, which is available on Hugging Face. The dataset was created by Open Food Facts and was used to train their own production model for detecting nutrition tables, providing a robust, real-world foundation for this fine-tuning task.
For our purposes, we will focus on the following key fields from each sample:
image: The input image loaded as a PIL object.width&height: The original dimensions of the image in pixels. These are essential for visualizing the bounding boxes.objects: A dictionary containing the ground-truth annotations for the image.bbox: A list containing the bounding box coordinates.category_name: A list containing the object's class name, the main one being'nutrition-table'.
Normalized Bounding Box CoordinatesΒΆ
The bounding box coordinates are normalized, meaning their values range from 0 to 1. The coordinates are provided in the format [y_min, x_min, y_max, x_max].
This is a standard practice in computer vision because it makes the model's training process independent of the input image's resolution. To properly visualize these normalized coordinates on an image, we must scale them back to pixel values using the image's original width and height:
absolute_x = normalized_x * image_widthabsolute_y = normalized_y * image_height
def show_bboxes(example, show_labels=True, figsize=(8, 8)):
"""
Show all bounding boxes for a single HF example dict from
openfoodfacts/nutrition-table-detection.
Args:
example: dict with keys ["image", "objects", "image_id", ...]
show_labels: draw 1..n in top-left inside each box
figsize: matplotlib figure size
"""
img = example["image"].copy()
w, h = img.size
draw = ImageDraw.Draw(img)
# scale line width & font for visibility on big images
lw = max(2, h // 400)
fs = max(18, h // 30)
try:
font = ImageFont.truetype("DejaVuSans-Bold.ttf", fs)
except:
font = ImageFont.load_default()
for i, bb in enumerate(example["objects"]["bbox"], start=1):
# dataset format: [y_min, x_min, y_max, x_max] normalized
y_min, x_min, y_max, x_max = map(float, bb)
x0, y0 = int(x_min * w), int(y_min * h)
x1, y1 = int(x_max * w), int(y_max * h)
draw.rectangle([x0, y0, x1, y1], outline="red", width=lw)
if show_labels:
draw.text((x0 + 5, y0 + 5), str(i), fill="red", font=font)
plt.figure(figsize=figsize)
plt.imshow(img)
title = f"Image ID: {example.get('image_id', 'unknown')} β’ {len(example['objects']['bbox'])} boxes"
plt.title(title)
plt.axis("off")
plt.show()
show_bboxes(example)
Analysis of Data DistributionsΒΆ
After visualizing the dataset, I drew several key conclusions that directly influenced my modeling and memory management strategy.
Key Observations & ImplicationsΒΆ
Variable Image Resolutions: The histograms show a wide distribution of image widths and heights, with no single standard size. While the Qwen2-VL architecture is designed to handle variable resolutions by breaking images into patches, this variation presents a significant memory challenge. A very large image can result in a long sequence of visual tokens, drastically increasing the VRAM required for even a single sample (
batch_size=1). This observation validated my decision to implement aMAX_PIXELSlimit as a crucial memory optimization technique.Small Bounding Boxes: The bounding boxes for nutrition tables are typically small relative to the overall image dimensions. This suggests that the model needs to be effective at identifying small features within a larger context.
Handling Multiple Detections: While most images in this dataset contain a single nutrition table, a robust evaluation plan must account for cases with multiple ground-truth boxes or multiple model predictions. My approach for calculating the Mean IoU will be to match each predicted box to the ground-truth box that has the highest overlap. This ensures a fair evaluation, even in complex scenarios.
### get the histogram of the image sizes
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
def build_image_stats(ds, split_name):
widths, heights, bbox_counts, unique_categories, categories = [], [], [], [], []
for row in ds:
w, h = row["image"].size
widths.append(w)
heights.append(h)
names = row["objects"].get("category_name") or ["unknown"]
bbox_counts.append(len(names))
unique_categories.append(len(set(names)))
categories.append(", ".join(names))
return pd.DataFrame({
"width": widths,
"height": heights,
"bbox_count": bbox_counts,
"unique_categories": unique_categories,
"category": categories,
"split": split_name,
})
df_train = build_image_stats(dataset_train_raw, "train")
df_eval = build_image_stats(dataset_test_raw, "eval")
stats_df = pd.concat([df_train, df_eval], axis=0)
sns.set_theme(style="whitegrid")
fig, axes = plt.subplots(1, 3, figsize=(18, 4))
_ = sns.histplot(data=stats_df, x="width", hue="split", stat="density", ax=axes[0], bins=30)
_ = axes[0].set_title("Image Width")
_ = sns.histplot(data=stats_df, x="height", hue="split", stat="density", ax=axes[1], bins=30)
_ = axes[1].set_title("Image Height")
# sns.histplot(data=stats_df, x="bbox_count", hue="split", discrete=True, ax=axes[2])
_ = sns.histplot(data=stats_df, x="bbox_count", hue="split", ax=axes[2])
_ = axes[2].set_title("# Bounding Boxes per Image")
fig.tight_layout()
_ = plt.figure(figsize=(8,4))
# sns.countplot(data=stats_df, x="unique_categories", hue="split", discrete=True)
_ = sns.countplot(data=stats_df, x="unique_categories", hue="split")
_ = plt.title("Unique Categories per Image")
plt.show()
_ = plt.figure(figsize=(10,4))
_ = sns.countplot(data=stats_df, x="category", order=stats_df["category"].value_counts().index)
_ = plt.xticks(rotation=45, ha="right")
_ = plt.title("Category Frequency")
plt.tight_layout()
plt.show()
/tmp/ipykernel_696/1450487219.py:11: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all Axes decorations. plt.tight_layout()
Understanding the Qwen2-VL ModelΒΆ
Before using the model, it's important to understand its core components and data requirements.
Architecture: The model consists of a Vision Encoder to process image patches, a Large Language Model (LLM) for text, and a Cross-Attention Mechanism that allows the LLM to "see" the visual information. It uses 2D Rotary Position Embeddings (RoPE) in the vision encoder to effectively understand the spatial relationships between image patches.
The Processor: The Hugging Face
processoris a critical utility that bundles all necessary preprocessing. It applies a chat template to structure the conversation, tokenizes the text, and performs "patch-ification" to convert images into a sequence of visual tokens.Expected Bounding Box Format: A key detail from the official Qwen-VL paper is that the model expects bounding box coordinates to be scaled to an integer grid of 1000x1000. My data preparation pipeline handles the conversion from the dataset's normalized
[0, 1]coordinates into the required format:nutrition-table<box(x1, y1),(x2, y2)>.
The Processor: A Unified Preprocessing PipelineΒΆ
The Hugging Face processor for Qwen2-VL is a critical utility that bundles all necessary preprocessing steps. It's more than just a tokenizer; it's a complete data preparation tool.
Chat Template Application: The process begins with the chat template. When given a conversational input (e.g., a user prompt with text and images), the processor's
apply_chat_templatefunction formats it into a single, structured string. It inserts control tokens like<|im_start|>userto manage turns and uses<img>...</img>as placeholders for images.Vision Processing: For each image, the processor calls an internal function similar to
process_vision_info. This function performs several key operations:- It resizes and normalizes the image to the expected dimensions and pixel value range.
- It performs "patch-ification," dicing the image into a sequence of smaller, fixed-size patches. These patches are the visual equivalent of text tokens.
- The final output is a
pixel_valuestensor, ready for the Vision Encoder.
Text Tokenization: The formatted prompt string (with image placeholders) is passed to the text tokenizer, which converts it into numerical
input_ids.
By handling these steps, the processor outputs a dictionary containing the input_ids, pixel_values, and attention_mask needed to feed the model.
Model Architecture and Forward PassΒΆ
The Qwen2-VL architecture is designed to fuse these two modalities:
- The Vision Encoder, a Transformer-based network, processes the image patches to extract high-level visual features.
- The LLM processes the text tokens.
- A Cross-Attention Mechanism acts as the bridge, allowing the LLM to "look at" the relevant visual features from the encoder at each step of text generation.
For a prompt with multiple images, such as <img>img1.jpg</img>Describe this. Now look at <img>img2.jpg</img> and compare., the model processes each image's patches separately. It uses techniques like "forbidden attention" to ensure that when generating text about the first image, it doesn't "see" the features from the second, maintaining context.
Positional Awareness: 2D RoPEΒΆ
A key innovation in modern Transformers, including Qwen2-VL's vision encoder, is the use of 2D Rotary Position Embedding (RoPE).
What is it? Traditional position embeddings add a vector to each token to give it a sense of its absolute location (e.g., "this is patch #5"). RoPE, however, is a more elegant solution that rotates each patch's embedding vector by an angle proportional to its (x, y) coordinates.
Why is it better? This rotational method inherently encodes the relative positions between patches directly into the self-attention calculation. The model doesn't just know where a patch is; it has a built-in, efficient way to understand how far apart patch A is from patch B, both horizontally and vertically. This is crucial for vision tasks, as it helps the model understand the spatial relationships that form objects and scenes without needing extra learnable parameters for position.
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", trust_remote_code=True)
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Baseline Model Memory FootprintΒΆ
Loading the base Qwen2-VL-7B model in its 16-bit format reveals its resource needs before any optimization.
- Parameters (4.7B): The model's weights require ~8.74 GB of VRAM.
- CUDA Allocated (9.02 GB): This is the active memory holding the model's weights.
- CUDA Reserved (13.73 GB): This is the total memory pool PyTorch has allocated from the GPU for current and future operations (like activations during inference).
This initial ~14 GB footprint confirms that full fine-tuning is challenging even on high-end GPUs like the A100 40GB, making parameter-efficient techniques like LoRA essential.
def print_model_memory(model):
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_gb = total_params * 2 / 1024**3 # bfloat16 weights = 2 bytes
print(f"Parameters: {total_params:,} (~{total_gb:.2f} GB)")
print(f"Trainable parameters: {trainable_params:,}")
if torch.cuda.is_available():
print(f"CUDA memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
print(f"CUDA memory reserved: {torch.cuda.memory_reserved()/1024**3:.2f} GB")
print_model_memory(model)
Parameters: 4,691,876,352 (~8.74 GB) Trainable parameters: 1,091,870,720 CUDA memory allocated: 5.53 GB CUDA memory reserved: 7.32 GB
def evaluate_vlm(model, processor, dataset, max_samples=None, iou_threshold=0.5, max_new_tokens=128):
"""
Evaluates a vision-language model on object detection.
Calculates:
1. True Mean IoU: Average of best IoU for each GT box (no threshold)
- Each GT box is matched to its best prediction
- Unmatched GT boxes contribute 0
- This is the TRUE mean across all GT boxes
2. Threshold-based metrics (precision, recall, F1):
- Uses iou_threshold for counting TP/FP/FN
- Greedy matching above threshold
Args:
model: VLM model
processor: Model processor
dataset: Test dataset (list or HF dataset)
max_samples: Optional limit on samples
iou_threshold: Threshold for precision/recall/F1 (NOT used for mean IoU)
max_new_tokens: Max tokens for generation
Returns:
dict with mean_gt_iou, precision, recall, f1, samples_evaluated
"""
model.eval()
total_iou_sum = 0.0
total_gt_boxes = 0
tp, fp, fn = 0, 0, 0
samples = dataset[:max_samples] if max_samples else dataset
for example in samples:
response = run_inference(
example,
model=model,
processor=processor,
max_new_tokens=max_new_tokens
)
pred_boxes = parse_bounding_boxes(response)
gt_boxes = example["objects"]["bbox"]
# Increment total ground truth boxes
total_gt_boxes += len(gt_boxes)
if not pred_boxes or not gt_boxes:
if not pred_boxes:
fn += len(gt_boxes) # Missed all GT boxes
if not gt_boxes:
fp += len(pred_boxes) # All predictions are false positives
continue
pred_tensor = torch.tensor(pred_boxes, dtype=torch.float32)
gt_tensor = torch.tensor(gt_boxes, dtype=torch.float32)[:, [1, 0, 3, 2]]
iou_matrix = box_iou(pred_tensor, gt_tensor) # [num_pred, num_gt]
# --- 1. True Mean IoU Calculation (No Threshold) ---
# For each GT box, find the IoU of its best-matching prediction.
# If a GT box has no match, its best IoU is 0.
if iou_matrix.numel() > 0:
best_ious_for_gt, _ = iou_matrix.max(dim=0) # Best pred for each GT
total_iou_sum += best_ious_for_gt.sum().item()
# else: no predictions, all GTs contribute 0 (already counted in total_gt_boxes)
# --- 2. Precision/Recall/F1 Calculation (With Threshold) ---
# Use greedy matching to find true positives above threshold
all_pairs = sorted(
[(iou_matrix[p, g].item(), p, g)
for p in range(iou_matrix.shape[0])
for g in range(iou_matrix.shape[1])],
reverse=True
)
matched_preds = set()
matched_gts = set()
for iou, p, g in all_pairs:
if iou < iou_threshold: # β Threshold ONLY affects TP/FP/FN
break
if p in matched_preds or g in matched_gts:
continue
matched_preds.add(p)
matched_gts.add(g)
tp += len(matched_preds)
fp += len(pred_boxes) - len(matched_preds)
fn += len(gt_boxes) - len(matched_preds)
# Final calculations
mean_iou = total_iou_sum / total_gt_boxes if total_gt_boxes else 0.0
precision = tp / (tp + fp) if (tp + fp) else 0.0
recall = tp / (tp + fn) if (tp + fn) else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
return {
'mean_gt_iou': mean_iou,
f'precision@{iou_threshold:.2f}': precision,
f'recall@{iou_threshold:.2f}': recall,
f'f1@{iou_threshold:.2f}': f1,
'samples_evaluated': len(samples),
}
Zero-Shot Baseline EvaluationΒΆ
My initial tests with a simple prompt confirmed the model's default behavior is to perform Optical Character Recognition (OCR). To get a true detection baseline, I had to engineer a more effective prompt to override this behavior.
Crafting the Final PromptΒΆ
The final prompt was designed to be highly explicit, aligning with the model's training data:
- It defines the task ("Detect all...").
- It specifies the exact output format (
"nutrition_label<box...>") and coordinate system ("...on a 1000x1000 canvas"). - It includes a negative constraint to prevent OCR ("Do not extract or describe any text...").
Final Baseline ResultsΒΆ
Using this engineered prompt, I ran the evaluation on the entire test set of 123 samples to get the final, official baseline metrics.
- Mean IoU: 0.27
- F1-Score (@0.50 IoU): 0.386
- Precision (@0.50 IoU): 0.395
- Recall (@0.50 IoU): 0.377
This proves that while the model can be guided to understand the task, it lacks the specialized ability to perform it accurately, justifying the need for fine-tuning.
Before fine-tuning, I established a zero-shot baseline to quantify the pre-trained model's performance. This provides a clear, numerical benchmark to measure the impact of my fine-tuning efforts.
baseline_metrics = evaluate_vlm(model, processor, dataset_test_raw, max_samples=None, iou_threshold=0.5)
print(baseline_metrics)
# sanity checks below
# run_inference(example)
# from itertools import islice
# for idx, example in enumerate(islice(dataset_test_raw, 10)):
# response = run_inference(example, max_new_tokens=256)
# pred_boxes = parse_bounding_boxes(response)
# gt_boxes = example["objects"]["bbox"]
# print(f"\nSample {idx}")
# print("Raw response:")
# print(response)
# print("Decoded predicted boxes:", pred_boxes)
# print("Ground-truth boxes:", gt_boxes)
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
{'mean_gt_iou': 0.43347253386790935, 'precision@0.50': 0.6097560975609756, 'recall@0.50': 0.5769230769230769, 'f1@0.50': 0.5928853754940712, 'samples_evaluated': 123}
# def iou_debug(model, processor, dataset, num_samples=5):
# samples = islice(dataset, num_samples)
# for i, example in enumerate(samples):
# response = run_inference(example, max_new_tokens=256)
# preds = parse_bounding_boxes(response)
# gts = example["objects"]["bbox"]
# if preds:
# gt = torch.tensor(gts, dtype=torch.float32)[:, [1,0,3,2]]
# pr = torch.tensor(preds, dtype=torch.float32)[:, [1,0,3,2]]
# ious = box_iou(gt, pr).max(dim=1).values.tolist()
# else:
# ious = [0.0] * len(gts)
# print(f"Sample {i} IoUs:", ious)
# iou_debug(model, processor, dataset_test_raw, num_samples=5)
Qualitative Analysis of Baseline PerformanceΒΆ
To provide a visual understanding of the baseline performance, I overlaid the model's predicted bounding box (in red) on top of the ground-truth box (in green) for a sample image.
As shown, while the model correctly identifies the general region of the nutrition table, it lacks the precision needed for a practical application. The low IoU score for this sample visually corresponds to the significant misalignment between the two boxes. This qualitative result reinforces the need for fine-tuning to improve the model's localization accuracy.
def visualize_prediction(example, response, title="Prediction vs. Ground Truth"):
image = example["image"].copy()
draw = ImageDraw.Draw(image)
w, h = image.size
# Ground truth boxes come as [ymin, xmin, ymax, xmax]
for y_min, x_min, y_max, x_max in example["objects"]["bbox"]:
draw.rectangle(
[(x_min * w, y_min * h), (x_max * w, y_max * h)],
outline="lime",
width=4,
)
# Predictions from parse_bounding_boxes are [x_min, y_min, x_max, y_max]
for x_min, y_min, x_max, y_max in parse_bounding_boxes(response):
draw.rectangle(
[(x_min * w, y_min * h), (x_max * w, y_max * h)],
outline="red",
width=4,
)
plt.figure(figsize=(8, 8))
plt.imshow(image)
plt.title(title)
plt.axis("off")
plt.show()
# sample = dataset_test_raw[0]
sample =dataset_train_raw[657]
response = run_inference(sample, max_new_tokens=256)
visualize_prediction(sample, response)
Fine-Tuning Strategy and Data PreparationΒΆ
With a clear baseline established, the next step is to fine-tune the model to improve its accuracy. This section outlines my strategy for training and the data preparation required.
Training Objective vs. Evaluation MetricΒΆ
A key decision in this project is to separate the training objective from the evaluation metric.
- Training Objective (Cross-Entropy Loss): The model is trained to minimize cross-entropy loss, which measures the accuracy of token-by-token text prediction. It is a differentiable function, which is essential for backpropagation.
- Limitation: It is strict on syntax. The model is penalized for any textual deviation from the ground truth, even if the meaning (i.e., the bounding box coordinates) is identical.
- Evaluation Metric (Mean IoU): To measure true task success, I use Mean IoU, which calculates the geometric overlap between the predicted and ground-truth boxes. It is a direct measure of geometric accuracy.
My approach is to train with cross-entropy loss but select the best checkpoint based on the highest Mean IoU on the validation set. This aligns the final model with the true task goal and helps monitor for overfitting.
Fine-Tuning ExperimentsΒΆ
I will explore two LoRA strategies to determine the most effective fine-tuning approach:
- Language-Only LoRA: Adapts only the LLM to better interpret the visual features.
- Vision+Language LoRA: Adapts both the vision encoder and the LLM to adapt and refine the visual features themselves.
Final Training Sample StructureΒΆ
The code below shows the final data structure that will be fed into the trainer. It combines the image, the engineered prompt, and the ground-truth assistant response with coordinates scaled to the required 1000x1000 format.
[
{
"role": "user",
"content": [
{ "type": "image", "image_url": "path/to/image.jpg" },
{ "type": "text", "text": "Detect all nutrition label regions in this image. Respond with their bounding boxes using the format \"nutrition_label<box(x_min, y_min),(x_max, y_max)>\" on a 1000x1000 canvas. If there are multiple labels, return all of them on separate lines. Do not extract or describe any text β only detect and localize the label areas." }
]
},
{
"role": "assistant",
"content": "nutrition-table<box(250, 300),(450, 500)>" # Example scaled coordinates
}
]
# Reset GPU memory before (re)loading the base model + LoRA adapters
clear_memory()
GPU allocated memory: 0.00 GB GPU reserved memory: 0.00 GB
!nvidia-smi
Fri Oct 10 14:54:51 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:41:00.0 Off | 0 |
| N/A 34C P0 66W / 300W | 423MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
Rationale for Parameter-Efficient Fine-Tuning (PEFT)ΒΆ
Fine-tuning all 7 billion parameters of the Qwen2-VL model is not only impractical from a hardware perspective but also often suboptimal for performance. It risks catastrophic forgetting, where the model loses its powerful, general-purpose abilities, and can quickly overfit to a small dataset.
Instead, I'm using Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA). This allows me to adapt the model by training less than 0.1% of its total parameters, preserving its core knowledge while teaching it our specific task.
Why Full Fine-Tuning is Infeasible on an A100 40GB GPUΒΆ
A quick calculation demonstrates the memory constraints. For a 7-billion-parameter model, a full fine-tuning process requires VRAM for more than just the model weights:
- Model Weights (16-bit): 7B params Γ 2 bytes/param β 14 GB
- Gradients (16-bit): 7B params Γ 2 bytes/param β 14 GB
- Optimizer States (AdamW): 7B params Γ 8 bytes/param (for 32-bit moments) β 56 GB
The total, ~84 GB, exceeds the 40 GB or 80 GB capacity of a A100 GPU before even accounting for the memory needed for activations, which is where your image data comes in. This makes full fine-tuning impossible.
My Multi-Faceted Memory Optimization StrategyΒΆ
To solve this, I implemented a multi-faceted strategy where each component addresses a different part of the memory problem:
- LoRA & 8-bit Quantization: This is the primary solution. By freezing the original weights and only training small LoRA adapters with an 8-bit optimizer (
paged_adamw_8bit), I drastically reduce the memory needed for gradients and optimizer states from >70 GB to just a few hundred megabytes. MAX_PIXELSImage Resizing: This addresses the activation memory. Even with LoRA, processing very high-resolution images can create large activation maps that cause out-of-memory (OOM) errors. By setting a maximum pixel count, I ensure that the memory required for the forward and backward passes remains within the GPU's limits, even for abatch_size=1.- Gradient Checkpointing & Accumulation: These techniques are the final polish. Gradient checkpointing trades compute time for memory, and accumulating gradients over 4 steps allows me to simulate a larger, more stable batch size of 4 without the associated memory cost.
This deliberate, multi-pronged approach shows a clear understanding of the bottlenecks in VLM training and provides a robust solution.
Of course. It's a great idea to document this decision. It shows you're being thoughtful about the trade-offs between data fidelity and hardware limitations.
Here is the markdown you can add to your notebook, referencing the image distribution chart you've already created.
Pre-processing Strategy: Handling Variable Image ResolutionsΒΆ
My analysis of the dataset revealed a wide distribution of image dimensions, with a long tail of very high-resolution images.
These large outlier images can cause out-of-memory (OOM) errors during the initial data loading phase (dataset.map()), even before the trainer's optimizations are applied.
To solve this, I've implemented a two-stage resizing strategy:
- Pre-emptive Resizing (Safety Net): Inside my
create_chat_formatfunction, I first cap the maximum size of any image by ensuring its longest side does not exceed 1024 pixels. I chose1024as a balance between preserving as much visual detail as possible for the model to learn from, while still being a safe enough size to likely avoid OOM errors on the A100 40GB during data preparation. - Final Resizing (
MAX_PIXELS): After this initial safety check, the trainer'svision_processortakes over and applies the finalMAX_PIXELS = 470,000constraint. This ensures every image fed into the training batch has a consistent memory footprint.
This approach allows me to retain valuable detail from larger images while guaranteeing that the training process remains stable and within my VRAM budget.
DOWNSIZE = True
def create_chat_format(sample):
"""
Converts a sample from the OpenFoodFacts dataset to the Qwen2-VL chat format.
*** This version correctly normalizes bounding box coordinates to a 0-1000 scale. ***
"""
assistant_response = ""
objects = sample["objects"]
if DOWNSIZE:
max_long_side = 1024
img = sample["image"].copy()
img.thumbnail((max_long_side, max_long_side), Image.Resampling.LANCZOS)
sample["image"] = img
for i in range(len(objects["bbox"])):
category = objects["category_name"][i]
box = objects["bbox"][i]
y_min_norm, x_min_norm, y_max_norm, x_max_norm = box
x_min = int(x_min_norm * 1000)
y_min = int(y_min_norm * 1000)
x_max = int(x_max_norm * 1000)
y_max = int(y_max_norm * 1000)
assistant_response += (
f"<|object_ref_start|>{category}<|object_ref_end|>"
f"<|box_start|>({x_min},{y_min}),({x_max},{y_max})<|box_end|> "
)
messages = [
{"role": "system", "content": SYSTEM_MESSAGE},
{
"role": "user",
"content": [
{"type": "image", "image": sample["image"]},
{"type": "text", "text": USER_PROMPT},
],
},
{"role": "assistant", "content": assistant_response.strip()},
]
return {"image": sample["image"], "messages": messages}
print("Formatting training dataset...")
train_dataset = [create_chat_format(sample) for sample in dataset_train_raw]
print("Formatting evaluation dataset...")
eval_dataset = [create_chat_format(sample) for sample in dataset_test_raw]
print(f"β
Datasets formatted: {len(train_dataset)} train, {len(eval_dataset)} eval")
Formatting training dataset... Formatting evaluation dataset... β Datasets formatted: 1083 train, 123 eval
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_math_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
print('β
Flash Attention kernels enabled (flash_sdp).')
β Flash Attention kernels enabled (flash_sdp).
# ----------------------------------------------------------------------------------
# CRITICAL MEMORY FIX: Set MAX_PIXELS to constrain activation memory
# ----------------------------------------------------------------------------------
# The Qwen2-VL processor converts each image into a grid of patches. The total
# number of patches is determined by the image's resolution. Without a cap,
# high-resolution images can create an extremely large number of patches,
# leading to out-of-memory errors from the activation maps during the forward pass.
#
# By setting MAX_PIXELS, we cap the total size of the feature map, which is the
# primary lever for controlling VRAM usage from image data. This provides a
# massive memory saving (~8-9 GB) compared to using original resolutions.
#
# A value of 470,400 (600 * 28 * 28) was chosen as a conservative but effective
# setting for the A100 GPU.
# ----------------------------------------------------------------------------------
vision_process.MAX_PIXELS = 600 * 28 * 28
print(f"β
MAX_PIXELS set to: {vision_process.MAX_PIXELS:,} pixels to manage VRAM.")
from qwen_vl_utils import process_vision_info, vision_process
import torch
# Verify MAX_PIXELS is set
print(f"MAX_PIXELS: {vision_process.MAX_PIXELS:,}")
β MAX_PIXELS set to: 470,400 pixels to manage VRAM. MAX_PIXELS: 470,400
Fine-Tuning Experiments and TrainingΒΆ
Now I'll prepare the model for fine-tuning. This involves loading the model with 4-bit quantization to manage memory and then applying the LoRA configuration.
Deconstructing the QLoRA ConfigurationΒΆ
The BitsAndBytesConfig is the core of QLoRA. Here's what the key choices mean:
load_in_4bit=True: This instructs the library to load the large, frozen base model with its weights quantized to 4-bits, which is the primary source of memory savings.bnb_4bit_quant_type="nf4": I use the "NormalFloat 4-bit" (NF4) data type because it's specifically designed for the bell-curve distribution of neural network weights, offering better precision than standard 4-bit floats.bnb_4bit_compute_dtype=torch.bfloat16: This is a critical performance setting. It tells the model to de-quantize the 4-bit weights to 16-bitbfloat16for the actual matrix multiplications. GPUs have specialized hardware (Tensor Cores) optimized for 16-bit math, which provides a massive speedup.
clear_memory()
model_id = "Qwen/Qwen2-VL-7B-Instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
trust_remote_code=True,
quantization_config=bnb_config,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
print("β
Vision-Language model and processor loaded successfully!")
GPU allocated memory: 5.55 GB GPU reserved memory: 18.56 GB
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
β Vision-Language model and processor loaded successfully!
Debugging an Out-of-Memory Error During EvaluationΒΆ
During my initial training run, I encountered an out-of-memory (OOM) error at the end of the first epoch, specifically when the validation step began.
- Problem Diagnosis: The training itself was memory-stable, but during evaluation, the model would sometimes fail to generate an end-of-sequence token and produce an extremely long, unconstrained output. When the trainer tried to pad all validation predictions to match the length of this single long output, it attempted to allocate a massive tensor (~31 GB), causing the OOM crash.
- The Solution: To fix this, I created a
GenerationConfigobject to explicitly control the generation behavior during the evaluation phase. By settingmax_new_tokens=128, I provide a generous limit for the model to generate its short bounding box response, while preventing the runaway generation that caused the memory spike.
This configuration is passed to the SFTTrainer to ensure all mid-training evaluations are memory-safe.
from transformers import GenerationConfig
generation_config = GenerationConfig(
max_new_tokens=128, # or 256 if you prefer
do_sample=False,
num_beams=1,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
)
model.generation_config = generation_config # make it the default
# print(hasattr(model, "peft_config"))
class VLMDataCollator:
"""
Collate function for Qwen2-VL fine-tuning.
- Converts a mapped dataset example (with `messages` and `image`) into the
multimodal ChatML structure that Qwen expects: the user turn contains both the
image and the prompt text, and assistant turns carry plain text.
- Uses the Qwen processor to tokenize text and encode images, returning padded
batches with `input_ids`, `pixel_values`, and other multimodal features.
- Optionally masks the prompt tokens in `labels` (via `mask_prompt=True`) so that
the loss is computed only on the assistantβs answer. This lets you switch between
completion loss and full-text loss without redefining the collator.
"""
def __init__(self, processor, mask_prompt=True):
self.processor = processor
self.mask_prompt = mask_prompt
self.pad_id = processor.tokenizer.pad_token_id
def _to_multimodal_chat(self, conversation, image):
formatted = []
for message in conversation:
role = message.get('role')
content = message.get('content')
if isinstance(content, list) and content and isinstance(content[0], dict) and 'type' in content[0]:
formatted.append(message)
continue
text = content if isinstance(content, str) else ''
if role == 'user':
formatted.append({
'role': 'user',
'content': [
{'type': 'image', 'image': image},
{'type': 'text', 'text': text.replace('<|image_1|>', '').strip()},
],
})
else:
formatted.append({
'role': role,
'content': [{'type': 'text', 'text': text}],
})
return formatted
def __call__(self, features):
processed_conversations = []
prompts = []
image_inputs = []
for feature in features:
conversation = feature['messages']
image = feature['image']
multimodal = self._to_multimodal_chat(conversation, image)
processed_conversations.append(multimodal)
prompts.append(
self.processor.apply_chat_template(
multimodal, tokenize=False, add_generation_prompt=False
)
)
image_inputs.append(process_vision_info(multimodal)[0])
batch = self.processor(
text=prompts,
images=image_inputs,
return_tensors='pt',
padding=True,
)
batch['pixel_values'] = batch['pixel_values'].to(torch.bfloat16)
labels = batch['input_ids'].clone()
for idx, conversation in enumerate(processed_conversations):
prompt_only = conversation[:-1]
if not prompt_only:
continue
prompt_text = self.processor.apply_chat_template(
prompt_only, tokenize=False, add_generation_prompt=True
)
prompt_ids = self.processor.tokenizer(
prompt_text,
add_special_tokens=False,
return_attention_mask=False,
).input_ids
if self.mask_prompt:
labels[idx, : len(prompt_ids)] = -100
if self.pad_id is not None:
labels[batch['input_ids'] == self.pad_id] = -100
batch['labels'] = labels
return batch
Experiment Descriptions & HypothesesΒΆ
β‘οΈ Experiment 1a: Completion-Only Loss (Primary)
- Description: LoRA on the LLM only, with loss calculated just on the assistant's answer.
- Hypothesis: This will be the most effective method, as the model's learning is focused purely on the task of generating correct bounding box strings.
β‘οΈ Experiment 1b: Full-Text Loss (Sanity Check)
- Description: LoRA on the LLM only, but the loss is calculated over the entire conversation, including the prompt.
- Hypothesis: This will perform worse than 1a, as the model will waste capacity learning to predict the prompt it was already given.
β‘οΈ Experiment 2: Vision + Language LoRA (Advanced)
- Description: LoRA adapters are applied to both the vision encoder and the language model.
- Hypothesis: This may offer a slight improvement if the nutrition labels have distinct visual features not well-represented in the model's original pre-training data.
Training Configuration (SFTTrainer)ΒΆ
The SFTConfig is set up to balance performance and memory constraints on the A100 40GB GPU. Key choices include:
gradient_accumulation_steps: This allows a larger effective batch size for more stable gradients without increasing VRAM.bf16=True: Enables automatic mixed-precision training, which speeds up computation significantly on modern GPUs.gradient_checkpointing=True: A memory-saving technique that trades some computation time to reduce VRAM needed for storing activations.
π― LoRA Target Modules: LLM vs Vision Encoder (Qwen2-VL)ΒΆ
β Language Model (LLM) LayersΒΆ
- PEFT automatically matches all layers when you use simple strings like:
target_modules=["q_proj", "v_proj"]
- Matches:
model.model.layers.0.self_attn.q_projβ...layers.27.self_attn.v_proj
π‘ Why these?
Research & practice showq_projandv_projare often the most impactful for LoRA in transformer attention blocks β tuning them gives ~90% of performance gain with minimal overhead.
πΌοΈ Vision Encoder LayersΒΆ
- Naming is different:
model.visual.blocks.0.attn.qkvβ...blocks.31.attn.qkv - Use regex to avoid accidental matches:
r"visual\.blocks\.\d+\.attn\.qkv"
- β οΈ Avoid just
"qkv"β too generic, may match unintended modules later.
# ============================================================
# π§ CRITICAL FIX #2: Reduce LoRA configuration for memory efficiency
# ============================================================
# Original config had:
# - r=16 (rank 16)
# - lora_alpha=32
# - 7 target modules: ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "gate_proj", "down_proj"]
#
# This consumed ~700 MB - 1 GB for LoRA adapters alone!
#
# New config (matching N's working notebook):
# - r=8 (rank 8) β 4x fewer parameters per adapter
# - lora_alpha=16 (proportional to r)
# - 2 target modules: ["q_proj", "v_proj"] β 3.5x fewer modules
#
# Memory impact:
# - Before: ~700 MB for LoRA + ~360 MB gradients = ~1.06 GB
# - After: ~200 MB for LoRA + ~100 MB gradients = ~0.30 GB
# - Savings: ~760 MB!
from peft import LoraConfig
peft_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=[
"q_proj",
"v_proj",
r"visual\.blocks\.\d+\.attn\.qkv" # β vision encoder attention, exp2
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# model = get_peft_model(model, peft_config)
# model.print_trainable_parameters()
# ----------------------------------------------------------------------------------
# Training Configuration (`SFTConfig`)
# ----------------------------------------------------------------------------------
# The configuration below is optimized for a single A100 40GB GPU and implements
# an early stopping strategy by saving the model at each epoch and loading the
# best one at the end, based on the validation set's Mean IoU.
# ----------------------------------------------------------------------------------
# Memory impact of gradient checkpointing:
# - Without: ~9 GB for activations
# - With: ~0.6-1.0 GB for activations
# - Savings: ~8 GB!
#
# Trade-off: ~20% slower training, but makes training POSSIBLE!
# EXPERIMENT_NAME = 'exp1a'
# EXPERIMENT_NAME = 'exp1b'
EXPERIMENT_NAME = 'exp2'
exp_tag = EXPERIMENT_NAME
sft_config = SFTConfig(
output_dir=f"qwen2-7b-nutrition-a100_{exp_tag}",
num_train_epochs=7,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
bf16=True,
tf32=True,
optim="paged_adamw_8bit",
learning_rate=1e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
weight_decay=0.01,
max_grad_norm=0.3,
save_strategy="epoch",
load_best_model_at_end=False, # set to False for now
logging_steps=10,
report_to="none",
dataset_kwargs={"skip_prepare_dataset": True},
remove_unused_columns=False,
)
# === Manual Evaluation Strategy ===
# We disable automatic evaluation to prevent OOM errors and will
# evaluate all saved checkpoints manually after training.
sft_config.eval_strategy = "no" #"epoch"
sft_config.load_best_model_at_end = False # having issues with in loop eval text generation control
# sft_config.metric_for_best_model = "eval_mean_gt_iou"
# sft_config.greater_is_better = True
sft_config.generation_max_length = 128
print("β
SFTConfig created and optimized for single A100 with early stopping.")
print(f" Max epochs: {sft_config.num_train_epochs}")
print(f" Best model will be selected based on: {sft_config.metric_for_best_model}")
β SFTConfig created and optimized for single A100 with early stopping. Max epochs: 7 Best model will be selected based on: None
# MEMORY CHECK CELL
if 'model' not in globals():
raise RuntimeError('Load the model before running this diagnostics cell.')
try:
collator = vlm_collator
except NameError:
collator = VLMDataCollator(processor)
vlm_collator = collator
if 'batch_debug' not in locals():
sample = train_dataset[0]
batch_debug = collator([sample])
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
bytes_per_param = 2 # assume bfloat16 params/checkpoints
param_mem_gb = total_params * bytes_per_param / 1024**3
trainable_mem_gb = trainable_params * bytes_per_param / 1024**3
seq_len = batch_debug['input_ids'].shape[-1]
hidden_size = model.config.text_config.hidden_size
bytes_per_activation = 2 # bfloat16 activations
activation_mem_gb = (seq_len * hidden_size * bytes_per_activation *
sft_config.per_device_train_batch_size) / 1024**3
free_mem, total_mem = torch.cuda.mem_get_info()
free_mem_gb, total_mem_gb = free_mem / 1024**3, total_mem / 1024**3
print(f'Total params: {total_params:,} (~{param_mem_gb:.2f} GB)')
print(f'Trainable params: {trainable_params:,} (~{trainable_mem_gb:.2f} GB)')
print(f'Sequence length (debug batch): {seq_len}')
print(f'Hidden size: {hidden_size}')
print(f'Per-microbatch activation estimate: ~{activation_mem_gb:.2f} GB')
print(f'Gradient accumulation steps: {sft_config.gradient_accumulation_steps}')
print(f'Effective batch size: {sft_config.gradient_accumulation_steps * sft_config.per_device_train_batch_size}')
print(f'CUDA memory (free/total): {free_mem_gb:.2f} / {total_mem_gb:.2f} GB')
Total params: 4,691,876,352 (~8.74 GB) Trainable params: 1,091,870,720 (~2.03 GB) Sequence length (debug batch): 1136 Hidden size: 3584 Per-microbatch activation estimate: ~0.01 GB Gradient accumulation steps: 4 Effective batch size: 4 CUDA memory (free/total): 40.50 / 79.25 GB
mask_prompt = EXPERIMENT_NAME != 'exp1b' #should be true for exp1a and 2
vlm_collator = VLMDataCollator(processor, mask_prompt=mask_prompt)
print(f'β
Collator ready for {EXPERIMENT_NAME} (mask_prompt={mask_prompt})')
β Collator ready for exp2 (mask_prompt=True)
# can be used in training loop eval
def compute_metrics(eval_pred):
predictions, labels = eval_pred
# Decode predictions
decoded_preds = processor.batch_decode(predictions, skip_special_tokens=True)
# Replace -100 with pad token id in a copy of labels, then decode
labels_copy = labels.copy()
labels_copy[labels_copy == -100] = processor.tokenizer.pad_token_id
decoded_labels = processor.batch_decode(labels_copy, skip_special_tokens=True)
total_iou = 0.0
tp = fp = fn = 0
total_gt = 0
iou_threshold = 0.5
for pred_text, label_text in zip(decoded_preds, decoded_labels):
pred_boxes = parse_bounding_boxes(pred_text) # [x_min, y_min, x_max, y_max]
gt_boxes = parse_bounding_boxes(label_text) # same format now
if not gt_boxes and not pred_boxes:
continue
if not pred_boxes:
fn += len(gt_boxes)
total_gt += len(gt_boxes)
continue
if not gt_boxes:
fp += len(pred_boxes)
continue
pred_tensor = torch.tensor(pred_boxes, dtype=torch.float32)
gt_tensor = torch.tensor(gt_boxes, dtype=torch.float32)
iou_matrix = box_iou(pred_tensor, gt_tensor)
if iou_matrix.numel() == 0:
fn += len(gt_boxes)
fp += len(pred_boxes)
total_gt += len(gt_boxes)
continue
# greedy match
all_pairs = [
(iou_matrix[p, g].item(), p, g)
for p in range(iou_matrix.shape[0])
for g in range(iou_matrix.shape[1])
]
all_pairs.sort(reverse=True)
matched_preds = set()
matched_gts = set()
matched_iou_sum = 0.0
for iou, p, g in all_pairs:
if iou < iou_threshold:
break
if p in matched_preds or g in matched_gts:
continue
matched_preds.add(p)
matched_gts.add(g)
matched_iou_sum += iou
tp += len(matched_preds)
fp += len(pred_boxes) - len(matched_preds)
fn += len(gt_boxes) - len(matched_preds)
total_iou += matched_iou_sum
total_gt += len(gt_boxes)
mean_iou = total_iou / total_gt if total_gt else 0.0
precision = tp / (tp + fp) if (tp + fp) else 0.0
recall = tp / (tp + fn) if (tp + fn) else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
return {
"mean_gt_iou": mean_iou,
"precision": precision,
"recall": recall,
"f1": f1,
}
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=vlm_collator,
peft_config=peft_config,
compute_metrics=compute_metrics,
)
trainer.model.print_trainable_parameters() # just to confirm LoRA is live
train_output = trainer.train()
# print(train_output)
Evaluation Setup: SDPA Attention ImplementationΒΆ
For all evaluations (baseline and fine-tuned checkpoints), I used SDPA (Scaled Dot Product Attention) which is PyTorch's native attention implementation:
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
quantization_config=bnb_config,
device_map="auto",
attn_implementation="sdpa", # Use PyTorch SDPA
)
Why SDPA instead of Flash Attention?
- Compatibility: Works reliably with 4-bit quantization + bfloat16
- Stability: No kernel fallback issues during inference
- Consistency: Same attention mechanism across all evaluations (baseline + experiments)
- Sufficient Performance: Evaluation is not bottlenecked by attention (model loading takes longer)
Training vs Evaluation:
- Training: Used default attention (Flash Attention when available) for maximum memory efficiency
- Evaluation: Explicitly specified SDPA for consistent, stable inference
This ensures apples-to-apples comparison across all checkpoints and the baseline model.
def downsize_images(sample):
"""Only resize images, keep everything else intact"""
max_long_side = 1024
img = sample["image"].copy()
img.thumbnail((max_long_side, max_long_side), Image.Resampling.LANCZOS)
sample["image"] = img
return sample
# Apply downsizing to RAW dataset (this keeps "objects" field)
dataset_test_downsized = [downsize_images(sample) for sample in dataset_test_raw]
Checkpoint EvaluationΒΆ
# ============================================================================
# CHECKPOINT EVALUATION - Find Best Model Using evaluate_vlm
# ============================================================================
"""
This cell evaluates all training checkpoints to find the best performing model.
WHY evaluate_vlm():
- Ensures ALL ground truth boxes are counted (matched or not)
- Unmatched GT boxes contribute 0 to IoU (included in denominator)
Example: If image has 3 GT boxes but model predicts 1:
- 1 matched box contributes its IoU (e.g., 0.8)
- 2 unmatched boxes contribute 0.0
- mean_iou = 0.8 / 3 = 0.267
"""
EXPERIMENT_NAME = 'exp2' # CHANGE THIS: 'exp1a', 'exp1b', or 'exp2'
output_dir = f"qwen2-7b-nutrition-a100_{EXPERIMENT_NAME}"
print("="*80)
print(f"π Evaluating {EXPERIMENT_NAME} checkpoints with evaluate_vlm")
print("="*80)
# ============================================================================
# Load processor (shared across all checkpoints)
# ============================================================================
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Set MAX_PIXELS
vision_process.MAX_PIXELS = 600 * 28 * 28
print(f"β
MAX_PIXELS set to: {vision_process.MAX_PIXELS:,} pixels")
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
# ============================================================================
# Step 1: Find all checkpoint directories
# ============================================================================
all_items = os.listdir(output_dir)
def extract_checkpoint_number(checkpoint_name):
"""
Extract step number from checkpoint name.
Args:
checkpoint_name: String like 'checkpoint-271'
Returns:
int: Step number (271) or None if not a valid checkpoint
"""
try:
return int(checkpoint_name.split('-')[1])
except (IndexError, ValueError):
return None
# Filter only valid checkpoints and sort numerically
valid_checkpoints = [d for d in all_items if extract_checkpoint_number(d) is not None]
checkpoints = sorted(valid_checkpoints, key=extract_checkpoint_number)
print(f"\nπ¦ Found {len(checkpoints)} checkpoints to evaluate")
print(f" Range: {checkpoints[0]} to {checkpoints[-1]}")
# ============================================================================
# Step 2: Evaluate each checkpoint
# ============================================================================
checkpoint_results = []
for i, checkpoint in enumerate(checkpoints, 1):
checkpoint_path = os.path.join(output_dir, checkpoint)
step = extract_checkpoint_number(checkpoint)
print(f"\n[{i}/{len(checkpoints)}] Evaluating {checkpoint} (step {step})...")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load base model (quantization apples to apples comparison)
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="sdpa",
)
# Load LoRA adapter weights
model = PeftModel.from_pretrained(base_model, checkpoint_path)
# Evaluate with evaluate_vlm (consistent with baseline)
metrics = evaluate_vlm(
model,
processor,
dataset_test_downsized, # Same test set as experiments
max_samples=None, # Evaluate all 123 samples
iou_threshold=0.5 # Standard threshold for detection
)
# Store results
checkpoint_results.append({
'checkpoint': checkpoint,
'checkpoint_step': step,
'mean_gt_iou': metrics['mean_gt_iou'], # Mean IoU over ALL GT boxes
'precision@0.5': metrics['precision@0.50'], # TP / (TP + FP)
'recall@0.5': metrics['recall@0.50'], # TP / (TP + FN)
'f1@0.5': metrics['f1@0.50'], # Harmonic mean
})
print(f" Mean GT IoU: {metrics['mean_gt_iou']:.3f}")
print(f" Precision: {metrics['precision@0.50']:.3f}")
print(f" Recall: {metrics['recall@0.50']:.3f}")
print(f" F1 Score: {metrics['f1@0.50']:.3f}")
# Clean up GPU memory
del model
del base_model
torch.cuda.empty_cache()
# ============================================================================
# Step 3: Find best checkpoint and save results
# ============================================================================
df = pd.DataFrame(checkpoint_results)
df = df.sort_values('checkpoint_step')
# Save detailed results
results_path = os.path.join(output_dir, f'{EXPERIMENT_NAME}_checkpoint_results.csv')
df.to_csv(results_path, index=False)
print(f"\nπΎ Saved results to: {results_path}")
# Find best checkpoint by mean GT IoU
best_idx = df['mean_gt_iou'].idxmax()
best_checkpoint = df.loc[best_idx, 'checkpoint']
best_iou = df.loc[best_idx, 'mean_gt_iou']
best_f1 = df.loc[best_idx, 'f1@0.5']
best_step = df.loc[best_idx, 'checkpoint_step']
print("\n" + "="*80)
print(f"π BEST CHECKPOINT: {best_checkpoint}")
print("="*80)
print(f" Step: {best_step}")
print(f" Mean GT IoU: {best_iou:.3f}")
print(f" F1 Score: {best_f1:.3f}")
print("="*80)
# Display all results in compact format
print(f"\nπ All Checkpoint Results:")
print(df[['checkpoint_step', 'mean_gt_iou', 'f1@0.5']].to_string(index=False))
print(f"\nβ
Checkpoint evaluation complete for {EXPERIMENT_NAME}")
'\nThis cell evaluates all training checkpoints to find the best performing model.\n\nWHY evaluate_vlm():\n- Ensures ALL ground truth boxes are counted (matched or not)\n- Unmatched GT boxes contribute 0 to IoU (included in denominator)\n\nExample: If image has 3 GT boxes but model predicts 1:\n- 1 matched box contributes its IoU (e.g., 0.8)\n- 2 unmatched boxes contribute 0.0\n- mean_iou = 0.8 / 3 = 0.267\n'
================================================================================ π Evaluating exp2 checkpoints with evaluate_vlm ================================================================================ β MAX_PIXELS set to: 470,400 pixels π¦ Found 7 checkpoints to evaluate Range: checkpoint-271 to checkpoint-1897
def calculate_iou(pred_boxes, gt_boxes):
"""
Calculate mean IoU between predicted and ground truth boxes
Args:
pred_boxes: List of [x_min, y_min, x_max, y_max] (normalized, corner format)
gt_boxes: List of [y_min, x_min, y_max, x_max] (normalized, corner format)
"""
if not pred_boxes or not gt_boxes:
return 0.0
ious = []
for gt_box in gt_boxes:
# GT format: [y_min, x_min, y_max, x_max] -> convert to [x_min, y_min, x_max, y_max]
gt_y_min, gt_x_min, gt_y_max, gt_x_max = gt_box
best_iou = 0.0
for pred_box in pred_boxes:
# Pred format: [x_min, y_min, x_max, y_max]
pred_x_min, pred_y_min, pred_x_max, pred_y_max = pred_box
# Calculate intersection (both now in same coordinate system)
x_left = max(gt_x_min, pred_x_min)
y_top = max(gt_y_min, pred_y_min)
x_right = min(gt_x_max, pred_x_max)
y_bottom = min(gt_y_max, pred_y_max)
if x_right > x_left and y_bottom > y_top:
intersection = (x_right - x_left) * (y_bottom - y_top)
# Calculate areas
gt_area = (gt_x_max - gt_x_min) * (gt_y_max - gt_y_min)
pred_area = (pred_x_max - pred_x_min) * (pred_y_max - pred_y_min)
union = gt_area + pred_area - intersection
iou = intersection / union if union > 0 else 0.0
best_iou = max(best_iou, iou)
ious.append(best_iou)
return sum(ious) / len(ious) if ious else 0.0
def get_sample_ious(model, processor, dataset, max_samples=None):
"""
Calculate IoU for each sample individually.
This function runs inference on each test sample and calculates the IoU
between predicted and ground truth boxes. Used for:
- Distribution analysis
- Failure case identification
- Individual sample visualization
Args:
model: Fine-tuned or baseline model
processor: AutoProcessor for the model
dataset: Test dataset (downsized)
max_samples: Optional limit on samples to process
Returns:
DataFrame with columns: sample_idx, image_id, iou, prediction, pred_boxes, gt_boxes
"""
sample_results = []
samples = dataset[:max_samples] if max_samples else dataset
for idx, example in enumerate(samples):
response = run_inference(example, model=model, processor=processor)
pred_boxes = parse_bounding_boxes(response)
gt_boxes = example["objects"]["bbox"]
iou = calculate_iou(pred_boxes, gt_boxes)
sample_results.append({
'sample_idx': idx,
'image_id': example.get('image_id', f'sample_{idx}'),
'iou': iou,
'prediction': response,
'pred_boxes': pred_boxes,
'gt_boxes': gt_boxes
})
if (idx + 1) % 20 == 0:
print(f" Processed {idx + 1}/{len(samples)} samples...")
return pd.DataFrame(sample_results)
# ============================================================
# COMPLETE EXPERIMENT ANALYSIS - ALL IN ONE CELL
# β οΈ IMPORTANT: Change EXPERIMENT_NAME for each run!
# ============================================================
# β
SET THIS - Change for each experiment: 'exp1a', 'exp1b', 'exp2'
# EXPERIMENT_NAME = 'exp1b'
EXPERIMENT_NAME = 'exp2'
output_dir = f"qwen2-7b-nutrition-a100_{EXPERIMENT_NAME}"
base_model_id = 'Qwen/Qwen2-VL-7B-Instruct'
print(f"\n{'='*70}")
print(f"ANALYZING EXPERIMENT: {EXPERIMENT_NAME}")
print(f"{'='*70}\n")
# ============================================================
# SETUP
# ============================================================
# Set MAX_PIXELS
vision_process.MAX_PIXELS = 600 * 28 * 28
print(f"β
MAX_PIXELS set to: {vision_process.MAX_PIXELS:,} pixels")
# Load processor
processor = AutoProcessor.from_pretrained(base_model_id, trust_remote_code=True)
# Disable SDPA
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
# ============================================================
# STEP 1: Load checkpoint evaluation results
# ============================================================
results_path = os.path.join(output_dir, f'{EXPERIMENT_NAME}_checkpoint_results.csv')
# Check if results exist
if not os.path.exists(results_path):
print(f"β Results file not found: {results_path}")
print(f" Run checkpoint evaluation first!")
raise FileNotFoundError(results_path)
df = pd.read_csv(results_path)
df['checkpoint_step'] = df['checkpoint'].str.extract(r'(\d+)').astype(int)
df_sorted = df.sort_values('checkpoint_step')
best_checkpoint = df.loc[df["mean_gt_iou"].idxmax(), "checkpoint"]
best_iou = df.loc[df['mean_gt_iou'].idxmax(), 'mean_gt_iou']
print(f"β
Loaded results from: {results_path}")
print(f"β
Best checkpoint: {best_checkpoint} (IoU: {best_iou:.4f})")
# ============================================================
# STEP 2: Load training history
# ============================================================
trainer_state_path = os.path.join(output_dir, best_checkpoint, "trainer_state.json")
if not os.path.exists(trainer_state_path):
print(f"β Training state not found: {trainer_state_path}")
raise FileNotFoundError(trainer_state_path)
with open(trainer_state_path) as f:
trainer_state = json.load(f)
history = pd.DataFrame(trainer_state["log_history"])
train_loss = history.loc[history["loss"].notna(), ["step", "loss"]].copy()
train_loss["epoch"] = train_loss["step"] / 271
print(f"β
Loaded training history")
# ============================================================
# STEP 3: Plot training progress
# ============================================================
print(f"\n{'='*70}")
print(f"Generating training progress plot...")
print(f"{'='*70}\n")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))
fig.suptitle(f'{EXPERIMENT_NAME}: Training Progress', fontsize=16)
# Training loss
_= ax1.plot(train_loss["epoch"], train_loss["loss"], linewidth=2, color='#2E86AB')
_= ax1.set_xlabel('Epoch', fontsize=12)
_= ax1.set_ylabel('Training Loss (Cross Entropy)', fontsize=12)
_= ax1.set_title('Training Loss', fontsize=14)
_= ax1.grid(True, alpha=0.3)
# Validation metrics
_= ax2_twin = ax2.twinx()
_= line1 = ax2.plot(df_sorted["checkpoint_step"] / 271, df_sorted["mean_gt_iou"],
marker='o', linewidth=2, markersize=6, color='#A23B72', label='Mean IoU')
_= ax2.set_xlabel('Epoch', fontsize=12)
_= ax2.set_ylabel('Mean IoU', fontsize=12, color='#A23B72')
_= ax2.tick_params(axis='y', labelcolor='#A23B72')
# print(df_sorted)
line2 = ax2_twin.plot(df_sorted["checkpoint_step"] / 271, df_sorted["f1@0.5"],
marker='s', linewidth=2, markersize=6, color='#F18F01', label='F1 Score')
_= ax2_twin.set_ylabel('F1 Score', fontsize=12, color='#F18F01')
_= ax2_twin.tick_params(axis='y', labelcolor='#F18F01')
_= ax2.set_title('Validation Metrics', fontsize=14)
_= ax2.grid(True, alpha=0.3)
lines = line1 + line2
labels = [l.get_label() for l in lines]
ax2.legend(lines, labels, loc='lower right', fontsize=10)
plt.tight_layout()
plot_path = os.path.join(output_dir, f'training_validation_{EXPERIMENT_NAME}.png')
plt.savefig(plot_path, dpi=150, bbox_inches='tight')
print(f"β
Saved: {plot_path}")
plt.show()
# ============================================================
# STEP 4: Load best model for sample analysis
# ============================================================
print(f"\n{'='*70}")
print(f"Loading best model: {best_checkpoint}")
print(f"{'='*70}\n")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForImageTextToText.from_pretrained(
base_model_id,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="sdpa", # β This uses SDPA anyway
trust_remote_code=True,
)
best_ckpt_path = os.path.join(output_dir, best_checkpoint)
best_model = PeftModel.from_pretrained(base_model, best_ckpt_path, is_trainable=False)
_ = best_model.eval()
print(f"β
Model loaded")
# ============================================================
# STEP 5: Sample IoU analysis
# ============================================================
print(f"\n{'='*70}")
print(f"Analyzing sample-level IoUs...")
print(f"{'='*70}\n")
sample_df = get_sample_ious(best_model, processor, dataset_test_downsized)
# Save sample IoUs
sample_iou_path = os.path.join(output_dir, f'sample_ious_{EXPERIMENT_NAME}.csv')
sample_df.to_csv(sample_iou_path, index=False)
print(f"β
Saved sample IoUs: {sample_iou_path}")
# Stratified sampling
sample_df_sorted = sample_df.sort_values('iou')
bottom_quartile = sample_df_sorted.iloc[:len(sample_df)//4]
worst_samples = bottom_quartile.nsmallest(3, 'iou')['sample_idx'].tolist()
middle_quartiles = sample_df_sorted.iloc[len(sample_df)//4:3*len(sample_df)//4]
median_samples = middle_quartiles.sample(3, random_state=42)['sample_idx'].tolist()
top_quartile = sample_df_sorted.iloc[3*len(sample_df)//4:]
best_samples = top_quartile.nlargest(3, 'iou')['sample_idx'].tolist()
print(f"\nWorst samples (0-25%): {worst_samples}")
print(f" IoU: {[sample_df.loc[sample_df['sample_idx'] == i, 'iou'].values[0] for i in worst_samples]}")
print(f"\nMedian samples (25-75%): {median_samples}")
print(f" IoU: {[sample_df.loc[sample_df['sample_idx'] == i, 'iou'].values[0] for i in median_samples]}")
print(f"\nBest samples (75-100%): {best_samples}")
print(f" IoU: {[sample_df.loc[sample_df['sample_idx'] == i, 'iou'].values[0] for i in best_samples]}")
# ============================================================
# STEP 6: Visualize performance distribution
# ============================================================
print(f"\n{'='*70}")
print(f"Generating performance distribution visualization...")
print(f"{'='*70}\n")
fig, axes = plt.subplots(3, 3, figsize=(18, 18))
fig.suptitle(f'{EXPERIMENT_NAME}: Performance Distribution (Worst β Median β Best)', fontsize=16)
all_samples = worst_samples + median_samples + best_samples
sample_labels = ['Worst'] * 3 + ['Median'] * 3 + ['Best'] * 3
for idx, (sample_idx, label, ax) in enumerate(zip(all_samples, sample_labels, axes.flat)):
sample = dataset_test_downsized[sample_idx]
response = run_inference(sample, model=best_model, processor=processor, max_new_tokens=128)
image = sample["image"].copy()
draw = ImageDraw.Draw(image)
w, h = image.size
# GT (green)
for y_min, x_min, y_max, x_max in sample["objects"]["bbox"]:
_= draw.rectangle([(x_min * w, y_min * h), (x_max * w, y_max * h)], outline="lime", width=3);
# Pred (red)
pred_boxes = parse_bounding_boxes(response)
for x_min, y_min, x_max, y_max in pred_boxes:
draw.rectangle([(x_min * w, y_min * h), (x_max * w, y_max * h)], outline="red", width=3);
iou = sample_df.loc[sample_df['sample_idx'] == sample_idx, 'iou'].values[0]
_= ax.imshow(image)
_= ax.axis('off')
_= ax.set_title(f'{label} - IoU: {iou:.3f}', fontsize=11);
plt.tight_layout()
viz_path = os.path.join(output_dir, f'failure_analysis_{EXPERIMENT_NAME}.png')
plt.savefig(viz_path, dpi=150, bbox_inches='tight')
print(f"β
Saved: {viz_path}")
plt.show()
# ============================================================================
# PART 8: IoU Distribution Analysis (Bimodal Check)
# ============================================================================
print("\n" + "="*80)
print("π PART 8: IoU Distribution Analysis")
print("="*80)
# Plot IoU distribution
fig, axes = plt.subplots(1, 2, figsize=(16, 5));
# Histogram
axes[0].hist(sample_df['iou'], bins=30, edgecolor='black', alpha=0.7);
axes[0].axvline(sample_df['iou'].mean(), color='red', linestyle='--',
linewidth=2, label=f'Mean: {sample_df["iou"].mean():.3f}');
axes[0].axvline(sample_df['iou'].median(), color='green', linestyle='--',
linewidth=2, label=f'Median: {sample_df["iou"].median():.3f}');
axes[0].set_xlabel('IoU Score', fontsize=12);
axes[0].set_ylabel('Frequency', fontsize=12);
axes[0].set_title('IoU Distribution - Test Set', fontsize=14, fontweight='bold');
axes[0].legend(fontsize=11);
axes[0].grid(axis='y', alpha=0.3)
# Cumulative distribution
sorted_ious = np.sort(sample_df['iou'])
cumulative = np.arange(1, len(sorted_ious) + 1) / len(sorted_ious) * 100
axes[1].plot(sorted_ious, cumulative, linewidth=2);
axes[1].axhline(80, color='red', linestyle='--', alpha=0.5, label='80th percentile');
axes[1].axhline(50, color='green', linestyle='--', alpha=0.5, label='50th percentile');
axes[1].set_xlabel('IoU Score', fontsize=12);
axes[1].set_ylabel('Cumulative Percentage', fontsize=12);
axes[1].set_title('Cumulative IoU Distribution', fontsize=14, fontweight='bold');
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(output_dir, f'{EXPERIMENT_NAME}_iou_distribution.png'),
dpi=150, bbox_inches='tight')
plt.show()
# Print statistics
print(f"\nπ Distribution Statistics:")
print(f" Mean IoU: {sample_df['iou'].mean():.3f}")
print(f" Median IoU: {sample_df['iou'].median():.3f}")
print(f" Std Dev: {sample_df['iou'].std():.3f}")
print(f"\n Min IoU: {sample_df['iou'].min():.3f}")
print(f" Max IoU: {sample_df['iou'].max():.3f}")
# Quartile breakdown
q1 = sample_df['iou'].quantile(0.25)
q2 = sample_df['iou'].quantile(0.50)
q3 = sample_df['iou'].quantile(0.75)
print(f"\n 25th percentile: {q1:.3f}")
print(f" 50th percentile: {q2:.3f}")
print(f" 75th percentile: {q3:.3f}")
# Performance buckets
excellent = (sample_df['iou'] >= 0.8).sum()
good = ((sample_df['iou'] >= 0.6) & (sample_df['iou'] < 0.8)).sum()
poor = ((sample_df['iou'] >= 0.3) & (sample_df['iou'] < 0.6)).sum()
failures = (sample_df['iou'] < 0.3).sum()
total = len(sample_df)
print(f"\n Performance Buckets:")
print(f" Excellent (β₯0.8): {excellent:3d} ({excellent/total*100:.1f}%)")
print(f" Good (0.6-0.8): {good:3d} ({good/total*100:.1f}%)")
print(f" Poor (0.3-0.6): {poor:3d} ({poor/total*100:.1f}%)")
print(f" Failures (<0.3): {failures:3d} ({failures/total*100:.1f}%)")
# ============================================================================
# PART 9: Save All Predictions as Images
# ============================================================================
print("\n" + "="*80)
print("πΎ PART 9: Saving All Predictions as PNGs")
print("="*80)
output_viz_dir = os.path.join(output_dir, 'all_predictions')
os.makedirs(output_viz_dir, exist_ok=True)
print(f"\nSaving {len(sample_df)} prediction visualizations...")
from PIL import ImageDraw
for idx, row in sample_df.iterrows():
sample = dataset_test_downsized[row['sample_idx']]
# Use the working visualization approach
image = sample["image"].copy()
draw = ImageDraw.Draw(image)
w, h = image.size
# Ground truth boxes (normalized [ymin, xmin, ymax, xmax])
for y_min, x_min, y_max, x_max in sample["objects"]["bbox"]:
draw.rectangle(
[(x_min * w, y_min * h), (x_max * w, y_max * h)],
outline="lime",
width=4,
);
# Predicted boxes (from saved prediction text)
pred_boxes = parse_bounding_boxes(row['prediction'])
for x_min, y_min, x_max, y_max in pred_boxes:
_= draw.rectangle(
[(x_min * w, y_min * h), (x_max * w, y_max * h)],
outline="red",
width=4,
);
# Create matplotlib figure to save with title
_= fig, ax = plt.subplots(figsize=(10, 8));
_= ax.imshow(image);
_= ax.set_title(f"IoU: {row['iou']:.3f} | Image ID: {row['image_id']}",
fontsize=14, fontweight='bold');
_= ax.axis('off');
# Add legend
handles = [
plt.Line2D([0], [0], color='lime', linewidth=3, label='Ground Truth'),
plt.Line2D([0], [0], color='red', linewidth=3, label='Prediction')
]
_= ax.legend(handles=handles, loc='upper right', fontsize=10);
# Save
filename = f"{row['iou']:.3f}_{row['image_id']}.png"
_=plt.savefig(os.path.join(output_viz_dir, filename),
bbox_inches='tight', dpi=100);
plt.close()
# Don't use plt.show() - it causes hanging!
print(f"β
Saved {len(sample_df)} images to: {output_viz_dir}")
print(f" Files sorted by IoU (worst to best)")
# ============================================================================
# SUMMARY
# ============================================================================
print("\n" + "="*80)
print(f"β
{EXPERIMENT_NAME.upper()} ANALYSIS COMPLETE!")
print("="*80)
print(f"\nπ All outputs saved to: {output_dir}/")
print(f" β’ Training plot: {EXPERIMENT_NAME}_training_plot.png")
print(f" β’ Failure cases: {EXPERIMENT_NAME}_failure_cases.png")
print(f" β’ IoU distribution: {EXPERIMENT_NAME}_iou_distribution.png")
print(f" β’ Sample-level results: {EXPERIMENT_NAME}_sample_results.csv")
print(f" β’ All predictions: all_predictions/ ({len(sample_df)} images)")
print("\n" + "="*80)
# ============================================================
# CLEANUP
# ============================================================
del best_model, base_model
gc.collect()
torch.cuda.empty_cache()
print(f"\n{'='*70}")
print(f"β
ANALYSIS COMPLETE FOR {EXPERIMENT_NAME}")
print(f"Files saved:")
print(f" - {plot_path}")
print(f" - {viz_path}")
print(f" - {sample_iou_path}")
print(f"{'='*70}\n")
====================================================================== ANALYZING EXPERIMENT: exp2 ====================================================================== β MAX_PIXELS set to: 470,400 pixels
preprocessor_config.json: 0%| | 0.00/347 [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
vocab.json: 0.00B [00:00, ?B/s]
merges.txt: 0.00B [00:00, ?B/s]
tokenizer.json: 0.00B [00:00, ?B/s]
chat_template.json: 0.00B [00:00, ?B/s]
β Loaded results from: qwen2-7b-nutrition-a100_exp2/exp2_checkpoint_results.csv β Best checkpoint: checkpoint-1626 (IoU: 0.7476) β Loaded training history ====================================================================== Generating training progress plot... ======================================================================
Text(0.5, 0.98, 'exp2: Training Progress')
<matplotlib.legend.Legend at 0x70cb04e9b7c0>
β Saved: qwen2-7b-nutrition-a100_exp2/training_validation_exp2.png
====================================================================== Loading best model: checkpoint-1626 ======================================================================
config.json: 0.00B [00:00, ?B/s]
model.safetensors.index.json: 0.00B [00:00, ?B/s]
model-00001-of-00005.safetensors: 0%| | 0.00/3.90G [00:00<?, ?B/s]
model-00002-of-00005.safetensors: 0%| | 0.00/3.86G [00:00<?, ?B/s]
model-00003-of-00005.safetensors: 0%| | 0.00/3.86G [00:00<?, ?B/s]
model-00004-of-00005.safetensors: 0%| | 0.00/3.86G [00:00<?, ?B/s]
model-00005-of-00005.safetensors: 0%| | 0.00/1.09G [00:00<?, ?B/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
generation_config.json: 0%| | 0.00/244 [00:00<?, ?B/s]
β Model loaded ====================================================================== Analyzing sample-level IoUs... ====================================================================== Processed 20/123 samples... Processed 40/123 samples... Processed 60/123 samples... Processed 80/123 samples... Processed 100/123 samples... Processed 120/123 samples... β Saved sample IoUs: qwen2-7b-nutrition-a100_exp2/sample_ious_exp2.csv Worst samples (0-25%): [22, 37, 35] IoU: [np.float64(0.0), np.float64(0.03242145956322614), np.float64(0.17803740339843654)] Median samples (25-75%): [80, 16, 102] IoU: [np.float64(0.9265797347412967), np.float64(0.9340411615929093), np.float64(0.6447203293810863)] Best samples (75-100%): [75, 6, 83] IoU: [np.float64(1.0), np.float64(0.9958208953030407), np.float64(0.9843365018328653)] ====================================================================== Generating performance distribution visualization... ======================================================================
Text(0.5, 0.98, 'exp2: Performance Distribution (Worst β Median β Best)')
β Saved: qwen2-7b-nutrition-a100_exp2/failure_analysis_exp2.png